Independent variables include the followings
- Number of times pregnant
- Plasma glucose concentration after 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Body mass index (weight in kg/(height in m)^2)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Diabetes pedigree function
Pregnancies Glucose BloodPressure SkinThickness
Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:25.00
Median : 3.000 Median :117.00 Median : 72.00 Median :29.00
Mean : 3.845 Mean :121.69 Mean : 72.39 Mean :29.11
3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction Age
Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
1st Qu.:121.5 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
Median :156.0 Median :32.00 Median :0.3725 Median :29.00
Mean :155.8 Mean :32.45 Mean :0.4719 Mean :33.24
3rd Qu.:156.0 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
Outcome
0:500
1:268
EDA: distribution
Code
library(GGally) ggpairs(df3)
EDA: correlation coefficients
Code
library(ggcorrplot) r =cor(df3[,-9]) pmat =cor_pmat(r) ggcorrplot(r, hc.order =TRUE, type ="lower", lab =TRUE, p.mat = pmat)
Fit logistic regression
Code
lr_fit =glm(Outcome ~ ., data = df3, family ="binomial") summary(lr_fit)
Call:
glm(formula = Outcome ~ ., family = "binomial", data = df3)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6712 -0.7181 -0.3953 0.7124 2.3761
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.0899172 0.8121027 -11.193 < 2e-16 ***
Pregnancies 0.1251221 0.0323909 3.863 0.000112 ***
Glucose 0.0373449 0.0038787 9.628 < 2e-16 ***
BloodPressure -0.0089408 0.0085602 -1.044 0.296272
SkinThickness 0.0032461 0.0131290 0.247 0.804715
Insulin -0.0007828 0.0011740 -0.667 0.504897
BMI 0.0933558 0.0178437 5.232 1.68e-07 ***
DiabetesPedigreeFunction 0.8663243 0.2963963 2.923 0.003468 **
Age 0.0132023 0.0095108 1.388 0.165097
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 713.23 on 759 degrees of freedom
AIC: 731.23
Number of Fisher Scoring iterations: 5
Stepwise feature selection
to find the subset of variables for the best performing model
- forward selection
- backward elimination
- stepwise selection(mixed)
Code
library(MASS)step_model =stepAIC(lr_fit, direction ="both", trace = F) summary(step_model)
Call:
glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction,
family = "binomial", data = df3)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8235 -0.7243 -0.3992 0.7253 2.4350
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.189830 0.705765 -13.021 < 2e-16 ***
Pregnancies 0.143397 0.027547 5.206 1.93e-07 ***
Glucose 0.036917 0.003491 10.576 < 2e-16 ***
BMI 0.088693 0.014726 6.023 1.72e-09 ***
DiabetesPedigreeFunction 0.882700 0.294785 2.994 0.00275 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 716.13 on 763 degrees of freedom
AIC: 726.13
Number of Fisher Scoring iterations: 5